Skip to content

Restore SDID survey support for placebo and jackknife variance methods#365

Open
igerber wants to merge 15 commits intomainfrom
sdid-placebo-jackknife-survey
Open

Restore SDID survey support for placebo and jackknife variance methods#365
igerber wants to merge 15 commits intomainfrom
sdid-placebo-jackknife-survey

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

  • Closes the last SDID survey capability gap (TODO row 107). PR Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition #355 restored variance_method=\"bootstrap\" for strata/PSU/FPC via hybrid pairs-bootstrap + Rao-Wu + weighted-FW. This PR extends full-design support to variance_method=\"placebo\" and \"jackknife\", so all three SDID variance methods now handle both pweight-only and strata/PSU/FPC designs.
  • Placebo allocator: stratified permutation (Pesarin 2001). Each draw samples pseudo-treated indices uniformly within each stratum containing actual treated units; weighted-FW re-estimates ω and λ per draw with per-control survey weights threaded into both loss and regularization (reuses compute_sdid_unit_weights_survey / compute_time_weights_survey from PR Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition #355). Fit-time feasibility guards distinguish Case B (zero controls in a treated stratum) and Case C (fewer controls than treated in a treated stratum) with targeted ValueError messages — partial-permutation fallback rejected since it would silently change the null distribution.
  • Jackknife allocator: PSU-level leave-one-out with stratum aggregation (Rust & Rao 1996). SE² = Σ_h (1-f_h)·(n_h-1)/n_h·Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)² with f_h = n_h_sampled / fpc[h] (population-count FPC form from survey.py:338-356). λ held fixed across LOOs; ω subsetted, composed with rw, renormalized. Strata with n_h < 2 silently skipped; total-zero-variance → NaN + UserWarning. Unstratified single-PSU short-circuits to NaN.
  • Intentional allocator asymmetry (documented in REGISTRY): placebo ignores the PSU axis (classical stratified unit-level permutation test); jackknife respects PSU (canonical survey jackknife). Both respect strata. Each is the defensible analog for its role (null-distribution test vs design-based variance approximation).
  • Gate relaxation: deletes the fit-time NotImplementedError guard at synthetic_did.py:352-369 that previously rejected placebo/jackknife + strata/PSU/FPC. Replicate-weight designs remain rejected (closed-form variance double-counts with Rao-Wu-like rescaling).
  • Coverage MC: benchmarks/data/sdid_coverage.json extended with jackknife on stratified_survey. Bootstrap validates near-nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife reported with anti-conservatism caveat — se_over_truesd ≈ 0.46 with only 2 PSUs per stratum is a well-documented limitation of the stratified jackknife formula (1 effective DoF per stratum). Placebo is structurally infeasible on the existing stratified_survey DGP (its cohort packs into one stratum with 0 never-treated units); the placebo survey path is exercised via unit tests on a feasible fixture.

Methodology references

  • Method name(s): SyntheticDiD._placebo_variance_se_survey (stratified permutation + weighted-FW), SyntheticDiD._jackknife_se_survey (PSU-level LOO with stratum aggregation).
  • Paper / source link(s):
    • Arkhangelsky, Athey, Hirshberg, Imbens & Wager (2021), American Economic Review 111(12): 4088-4118 — Algorithms 3 (jackknife) and 4 (placebo), extended here to survey settings.
    • Pesarin, F. (2001), Multivariate Permutation Tests, Ch. 3-4 — stratified permutation foundation.
    • Pesarin, F. & Salmaso, L. (2010), Permutation Tests for Complex Data, Wiley.
    • Rust, K. F. & Rao, J. N. K. (1996), "Variance Estimation for Complex Surveys Using Replication Techniques", Statistical Methods in Medical Research 5(3): 283-310 — stratified jackknife SE formula.
  • Any intentional deviations from the source (and why):
    • Allocator asymmetry (placebo ignores PSU axis, jackknife respects it): placebo is a null-distribution test — unit-level within-stratum permutation is classical (Pesarin 2001); PSU-level permutation on few PSUs is near-degenerate. Jackknife is a design-based variance approximation — PSU-level LOO within strata is the canonical form (Rust & Rao 1996). Both respect strata. Documented in REGISTRY §SyntheticDiD.
    • λ held fixed across jackknife LOOs: matches Arkhangelsky Algorithm 3 (non-survey). Re-estimating λ per LOO would conflate weight-estimation uncertainty (bootstrap's domain) with sampling uncertainty (jackknife's domain).
    • Jackknife anti-conservatism with few PSUs per stratum: documented in REGISTRY as a methodology limitation, not a bug. Users needing tight SE calibration with few PSUs should prefer variance_method=\"bootstrap\".

Validation

  • Tests added/updated (tests/test_survey_phase5.py):
    • Flipped test_full_design_placebo_raisestest_full_design_placebo_succeeds + same for jackknife (assert finite SE > 0, populated survey_metadata, .summary() round-trip).
    • New class TestSDIDSurveyPlaceboFullDesign: pseudo-treated-stratum contract (monkeypatched recorder), Case B / Case C front-door guards with targeted-message regression, SE-differs-from-pweight-only, deterministic dispatch.
    • New class TestSDIDSurveyJackknifeFullDesign: FPC magnitude regression (2-stratum handcrafted panel asserts SE_fpc == SE_nofpc · sqrt(1 - f) at rtol=1e-10 with f=0.5), stratum-aggregation self-consistency, SE-differs-from-pweight-only, single-PSU-stratum silently skipped, unstratified short-circuit returns NaN, all-strata-skipped warning + NaN, deterministic dispatch.
    • tests/test_methodology_sdid.py::TestCoverageMCArtifact narrative + assertions updated.
    • Bit-identity preserved for all pre-existing pweight-only and non-survey placebo/jackknife tests (32 in TestBootstrapSE + TestPlaceboSE + TestJackknifeSE, 19 in TestSyntheticDiDSurvey) by gating the new code path on _full_design_survey.
  • Backtest / simulation / notebook evidence:
    • Full coverage MC regenerated at n_seeds=500, n_bootstrap=200. Bootstrap + stratified_survey: α=0.05 rejection = 0.058, SE/trueSD = 1.13 (unchanged from PR Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition #355). Jackknife + stratified_survey: α=0.05 rejection = 0.45, SE/trueSD = 0.46 (documented anti-conservatism caveat).
    • Pilot MC at n_seeds=100 confirmed that placebo is structurally infeasible on this DGP; stratified_survey_design_factory now omits placebo with documented rationale.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

@github-actions
Copy link
Copy Markdown

Overall Assessment

Blocker — the new full-design SDID jackknife path can report a finite SE even when a required PSU delete-one replicate is undefined, so the returned survey jackknife variance no longer matches the Rust–Rao formula the PR cites.

Executive Summary

  • Affected methods: SyntheticDiD._placebo_variance_se_survey and SyntheticDiD._jackknife_se_survey, plus the jackknife diagnostics exposed through SyntheticDiDResults.get_loo_effects_df().
  • The new placebo branch looks methodologically coherent from the diff: the stratum-restricted allocator and Case B/Case C front-door guards match the new REGISTRY note.
  • The new survey jackknife branch has a blocker: it silently skips PSU deletions that remove all treated units, but still computes a finite “Rust & Rao” SE on the remaining replicates.
  • The survey jackknife results payload is inconsistent with the existing per-unit LOO contract, so get_loo_effects_df() cannot be correct on the new path.
  • The PR also removes unrelated public APIs (profile_panel, PanelProfile, Alert, and get_llm_guide("autonomous")) in a survey-methodology PR; that is a separate breaking change.

Methodology

  • Severity: P0 Location: diff_diff/synthetic_did.py:L2208-L2268, docs/methodology/REGISTRY.md:L1594-L1604, tests/test_survey_phase5.py:L587-L598, tests/test_survey_phase5.py:L836-L860_jackknife_se_survey() skips any PSU deletion that leaves no treated units (if not treated_kept_mask.any(): continue), but still accumulates variance with the full-stratum Rust–Rao factor (n_h - 1) / n_h. The main new fixture puts all treated units in one PSU, so one required delete-one replicate is always missing while the test still accepts a finite SE. Impact: silent incorrect SE / p-value / CI on a supported full-design path, because the code is no longer computing the stated Σ_{j∈h} jackknife over all sampled PSUs. Concrete fix: either reject / return NaN whenever any contributing stratum has an undefined PSU-deletion replicate, or derive and document a different estimator that handles missing replicates and scales them consistently; then add a regression where all treated mass sits in one PSU.

Code Quality

  • Severity: P1 Location: diff_diff/synthetic_did.py:L883-L925, diff_diff/results.py:L1102-L1152 — the new survey jackknife returns a flat array of successful PSU-level delete-one estimates, but fit() still attaches _loo_unit_ids = control_units + treated_units, and get_loo_effects_df() still assumes one estimate per unit. Impact: on full-design survey fits the accessor either raises on length mismatch or mislabels PSU-level replicates as unit-level effects. Concrete fix: either disable get_loo_effects_df() for survey jackknife results, or store PSU-level metadata and expose a separate PSU-level accessor with matching semantics.

Performance

  • No material performance finding from the changed variance code. The new placebo path does extra weighted-FW work per draw, but that matches the documented methodology choice.

Maintainability

  • Severity: P1 Location: diff_diff/__init__.py:L7-L20,L252-L253,L497-L500, diff_diff/_guides_api.py:L7-L29, plus deletion of diff_diff/profile.py and diff_diff/guides/llms-autonomous.txt`` — the PR removes public, documented APIs unrelated to SDID survey variance: profile_panel, PanelProfile, Alert, and get_llm_guide("autonomous")`. Impact: downstream imports and runtime calls break in a PR whose stated purpose is restoring SDID survey support. Concrete fix: keep those APIs in this PR, or move their removal to a separate deprecation PR with compatibility shims and release notes.

Tech Debt

  • No mitigated debt item changes the assessment. Removing the old TODO row only closes the previous “capability gap”; it does not mitigate the new jackknife correctness issue above.

Security

  • No security finding in the changed code or docs.

Documentation/Tests

  • Severity: P2 Location: tests/test_survey_phase5.py:L836-L860test_jackknife_full_design_stratum_aggregation_self_consistency claims to recompute the stratum formula, but the body only checks that result.se is finite and positive. Impact: the blocker above can pass unchanged. Concrete fix: make this test actually reconstruct the expected SE from explicit replicate bookkeeping, or change it to assert NaN / error on undefined jackknife designs after the methodology fix.
  • Severity: P3 Location: docs/methodology/REGISTRY.md:L1569-L1569 — the “default variance_method deviation from R” note still says placebo is preferred because full survey support exists only for bootstrap. Impact: stale methodology docs. Concrete fix: update that note to reflect the new placebo / jackknife survey support.

Path to Approval

  1. Make the survey jackknife refuse or NaN out designs where a contributing PSU delete-one replicate is undefined, instead of silently skipping it and reporting a finite Rust–Rao SE.
  2. Fix the jackknife diagnostics contract for survey fits: either provide PSU-level replicate metadata and a PSU-level accessor, or block get_loo_effects_df() on survey jackknife results.
  3. Remove the unrelated profile_panel / "autonomous" API deletions from this PR, or ship them separately with a compatibility/deprecation path.
  4. Replace the current “self-consistency” jackknife test with one that actually verifies the estimator or the new rejection behavior.

igerber and others added 3 commits April 24, 2026 17:38
Closes the last SDID survey gap (TODO.md row 107). PR #355 restored
variance_method="bootstrap" for strata/PSU/FPC via hybrid pairs-
bootstrap + Rao-Wu + weighted-FW. This commit extends the same full-
design capability to variance_method="placebo" and "jackknife".

Placebo allocator — stratified permutation (Pesarin 2001).
Pseudo-treated indices drawn within each stratum containing actual
treated units; weighted-FW re-estimates ω and λ per draw with per-
control survey weights threaded into both loss and regularization
(reuses compute_sdid_unit_weights_survey + compute_time_weights_survey
from PR #355). New private method _placebo_variance_se_survey.

Fit-time front-door guards (per feedback_front_door_over_retry_swallow.md)
distinguish two infeasible permutation configurations with targeted
ValueError messages: Case B (stratum with treated units has zero
controls) and Case C (stratum with treated units has fewer controls
than treated). Partial-permutation fallback rejected — it silently
changes the null-distribution semantics.

Jackknife allocator — PSU-level leave-one-out with stratum
aggregation (Rust & Rao 1996). SE² = Σ_h (1-f_h)·(n_h-1)/n_h·
Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)². FPC form: f_h = n_h_sampled / fpc[h]
(population-count form from survey.py::SurveyDesign.resolve;
confirmed via survey.py:338-356 where fpc_h < n_psu_h is the
validation constraint). λ held fixed across LOOs; ω subset + rw-
composed-renormalized (matches Arkhangelsky Algorithm 3 non-survey
semantics — jackknife is variance-approximation, not refit-variance).
Strata with n_h < 2 skip silently; total-zero-variance → NaN +
UserWarning. Unstratified designs with PSU treated as single-stratum
JK1. New private method _jackknife_se_survey.

Gate relaxation — deletes the placebo+jackknife+strata/PSU/FPC raise
at synthetic_did.py:352-369. Replicate-weight gate at L329-337
unchanged (separate methodology; closed-form replicate variance
double-counts with Rao-Wu-like rescaling). fit() dispatcher adds
_placebo_use_survey_path / _jackknife_use_survey_path flags routing
to the new methods when appropriate; non-survey and pweight-only
paths bit-identical by construction (guarded by the same branch
isolation pattern used in PR #355 _bootstrap_se).

Allocator asymmetry — placebo ignores PSU axis; jackknife respects
it. Intentional: placebo is a null-distribution test (stratified unit-
level permutation is classical — PSU-level permutation on few PSUs is
near-degenerate), while jackknife is a design-based variance
approximation (PSU-level LOO is canonical per Rust & Rao). Both
respect strata. Rationale documented in method docstrings and
REGISTRY (follow-up commit).

Tests — tests/test_survey_phase5.py:
- TestSyntheticDiDSurvey: flip test_full_design_placebo_raises and
  test_full_design_jackknife_raises from NotImplementedError→succeeds;
  assert finite SE > 0, populated survey_metadata, .summary() round-trip.
- TestSDIDSurveyPlaceboFullDesign (new class): pseudo-treated-stays-
  within-treated-strata (monkeypatched recorder), Case B / Case C
  front-door guards (targeted ValueError match), se-differs-from-
  pweight-only, deterministic dispatch.
- TestSDIDSurveyJackknifeFullDesign (new class): stratum-aggregation
  self-consistency, fpc-reduces-se magnitude (SE_fpc = SE_nofpc/sqrt(2)
  at f=0.5, rtol=1e-10), se-differs-from-pweight-only, single-PSU-
  stratum silently skipped, unstratified short-circuit, all-strata-
  skipped UserWarning + NaN, deterministic dispatch.

Non-survey and pweight-only regressions — all 32 tests in
TestBootstrapSE + TestPlaceboSE + TestJackknifeSE pass unchanged;
bit-identity preserved by the new-path-gating pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…placebo, jackknife)

Second commit for the SDID survey-placebo/jackknife PR. Extends the
coverage Monte Carlo artifact with jackknife on the stratified_survey
DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY
§SyntheticDiD gap bullets to two landed Notes; updates user-facing
docs to reflect restored capability.

Coverage MC changes
-------------------
* benchmarks/python/coverage_sdid.py: _stratified_survey_design now
  returns ("bootstrap", "jackknife") on the methods tuple. Placebo is
  omitted because the DGP's cohort packs into a single stratum with 0
  never-treated units — stratified-permutation placebo is structurally
  infeasible on this DGP (raises Case C at fit-time). Module docstring
  explains the exclusion and the jackknife anti-conservatism caveat.
* benchmarks/data/sdid_coverage.json: regenerated stratified_survey
  block at n_seeds=500, n_bootstrap=200. Bootstrap validates near-
  nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row
  reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti-
  conservatism from the stratified jackknife formula with 2 PSUs per
  stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation).

REGISTRY.md §SyntheticDiD
-------------------------
* Survey support matrix updated: all three variance methods now
  support strata/PSU/FPC (not just bootstrap).
* Two new landed Notes:
  - "Note (survey + placebo composition)": stratified-permutation
    allocator, weighted-FW refit, ω_eff composition, fit-time
    feasibility guards (Case B / Case C), scope note on what is NOT
    randomized (within-stratum PSU axis). Cites Pesarin (2001) /
    Pesarin & Salmaso (2010).
  - "Note (survey + jackknife composition)": PSU-level LOO algorithm,
    explicit stratum-aggregation SE² formula, FPC handling (population-
    count form from survey.py:338-356), fixed-weights rationale,
    degenerate-LOO skip semantics, scope note, known anti-conservatism
    with few PSUs per stratum. Cites Rust & Rao (1996).
* "Allocator asymmetry" paragraph in the survey support matrix
  documents the intentional asymmetry (placebo ignores PSU, jackknife
  respects it) with rationale rooted in each method's role (null-
  distribution test vs design-based variance approximation).
* Coverage MC table adds the stratified_survey × jackknife row with
  anti-conservatism narrative; placebo row explicitly marked N/A-on-
  this-DGP (with pointer to the unit-test coverage).
* Requirements checklist entries updated to describe full-design
  support for placebo and jackknife.

Docs sweep
----------
* docs/methodology/survey-theory.md: new bullets describing the
  stratified-permutation placebo allocator and the PSU-level LOO
  jackknife, parallel to the existing hybrid-bootstrap bullet.
* docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID
  row updated from "bootstrap only (PR #352)" to "Full (all three
  variance methods)"; legend amended; "Note on SyntheticDiD" block
  rewritten to describe all three allocators with the jackknife
  few-PSU caveat.
* docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/
  jackknife gap; Phase 6 bullet updated to describe all three
  allocators; Current Limitations table entry removed (only replicate-
  weight limitation remains, merged into one row).
* CHANGELOG.md: "### Added" entry for placebo + jackknife full-design
  support (no new section header — folded into existing Unreleased
  block); "### Changed (PR #355)" tweaked to note the separate
  follow-up for placebo/jackknife.
* TODO.md row 107 deleted (capability gap closed).
* diff_diff/synthetic_did.py __init__ docstring: survey_design
  parameter description rewritten to describe all three methods.
  Placebo fallback-guidance comment updated to remove stale "placebo
  and jackknife reject strata/PSU/FPC" line.
* diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated
  to describe all three survey allocators (UTF-8 fingerprint
  preserved — `D'Haultfœuille` still appears throughout).
* tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative
  and assertions updated to reflect that placebo=0-fits is expected
  structurally on stratified_survey (documented Case C), while
  jackknife now runs successfully with the known anti-conservatism
  caveat intentionally unasserted at the calibration-gate level.

Verification
------------
* pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign
  tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign
  tests/test_survey_phase5.py::TestSyntheticDiDSurvey
  tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
  tests/test_guides.py → 82 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ck get_loo_effects_df on survey jackknife

P0 (Methodology — survey jackknife silently skipping undefined LOO):
The Rust & Rao (1996) stratified jackknife formula `SE² =
Σ_h (1-f_h)·(n_h-1)/n_h·Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)²` requires every
PSU-LOO `τ̂_{(h,j)}` to be defined. The previous implementation
silently skipped PSUs whose deletion removed all treated units (or
zeroed control ω_eff mass, or raised in the estimator) while still
applying the full `(n_h-1)/n_h` factor, under-scaling variance on
designs where treated units pack into a single PSU.

Fix: `_jackknife_se_survey` now tracks any undefined replicate in a
contributing stratum (n_h ≥ 2) and short-circuits to `SE=NaN` with a
targeted `UserWarning` naming the stratum / PSU / reason (deletion
removes all treated, kept ω_eff zero, kept treated survey mass zero,
estimator raised, estimator returned non-finite). Partial LOOs are
still returned in `placebo_effects` for debugging; users needing a
variance estimator that accommodates PSU-deletion infeasibility
should use `variance_method="bootstrap"`. Silent stratum-level skip
for `n_h < 2` is preserved (canonical lonely-PSU handling matching
R `survey::svyjkn`).

New regression `test_jackknife_full_design_undefined_replicate_returns_nan`
exercises the fix on the original `sdid_survey_data_full_design`
fixture (treated all in stratum 0 PSU 0 → LOO PSU 0 removes all
treated) and asserts both the `UserWarning` match and `np.isnan(se)`.

The existing jackknife tests that asserted finite SE now use a new
`sdid_survey_data_jk_well_formed` fixture where treated units are
spread across two PSUs within stratum 0 (so every LOO leaves ≥1
treated). The self-consistency test
(`test_jackknife_full_design_stratum_aggregation_formula_magnitude`)
was rewritten from a flaccid finite-positive check to a real
recomputation of the Rust & Rao formula on the returned 6-entry
`placebo_effects` array, asserting `result.se == pytest.approx(
expected, rel=1e-12)`.

Coverage MC (`benchmarks/data/sdid_coverage.json`) is unchanged:
the `stratified_survey` DGP spreads its 32 treated units across
PSUs 2 and 3 within stratum 1 and PSUs 0 and 1 within stratum 0,
so every LOO is defined there too. The previously-reported jackknife
anti-conservatism (α=0.05 rejection = 0.45, SE/trueSD = 0.46) is
the documented few-PSU limitation (1 effective DoF per stratum
with `n_h = 2`), not the P0 silent-skip bug.

P1 (Code Quality — get_loo_effects_df on survey jackknife):
`SyntheticDiDResults.get_loo_effects_df()` assumes a length-N
unit-indexed `placebo_effects` array (first n_control are control-
LOO, next n_treated are treated-LOO). Survey-jackknife fits return
a flat PSU-level replicate array of variable length; joining onto
the fit-time `control_unit_ids + treated_unit_ids` would mislabel
PSU replicates as unit-level effects.

Fix: `get_loo_effects_df()` now raises `NotImplementedError` with a
targeted message pointing to `result.placebo_effects` for the raw
PSU-level array and REGISTRY §SyntheticDiD "Note (survey + jackknife
composition)" for the aggregation formula. New regression
`test_get_loo_effects_df_raises_on_survey_jackknife` asserts the
raise on a survey fit. Non-survey and pweight-only jackknife fits
continue to use `get_loo_effects_df()` as before (unit-level LOO).

P3 (Documentation — stale default variance_method note):
`docs/methodology/REGISTRY.md:L1569` default-variance-method note
rewritten to reflect that all three variance methods now support
full survey designs (removing "full design supported on bootstrap
only" language) and to recommend bootstrap specifically on surveys
with few PSUs per stratum.

Branch also rebased onto current origin/main to pick up PR #356
(agent-profile-panel) and PR #361 — the R1 Maintainability finding
about "unrelated API deletions" was a stale-base-drift artifact
(my branch was created before #356 merged). After rebase the diff
against main shows only SDID-survey changes.

Verification
------------
pytest tests/test_survey_phase5.py
       tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
→ 87 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@igerber igerber force-pushed the sdid-placebo-jackknife-survey branch from c0e14c7 to c2b97e0 Compare April 24, 2026 21:48
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c2b97e060041accf825881fd7f61f7f3d1b9bd5a


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous blocker on silently skipping undefined survey jackknife replicates appears fixed.
  • The new survey jackknife accessor guard is too broad: it also blocks the older pweight-only survey jackknife path, which still has unit-level LOO output.
  • The placebo support matrix and REGISTRY now overstate support on PSU/FPC-only designs without strata; those fits still fall back to the old pweight-only allocator instead of the documented full-design weighted-FW path.
  • The very anti-conservative stratified_survey × jackknife coverage row is explicitly documented in REGISTRY.md as a few-PSU Rust-Rao limitation, so I am treating that as informational rather than a defect.
  • A few touched messages/docs remain stale after the re-review fixes.

Methodology

  • Severity: P1. Impact: SyntheticDiD.fit() now publicly documents that any full-design placebo fit (strata/PSU/FPC) uses stratified permutation plus weighted Frank-Wolfe, and REGISTRY.md says the same, but the actual dispatcher only enables _placebo_variance_se_survey() when strata is present; PSU/FPC-only designs fall through to _placebo_variance_se() instead (diff_diff/synthetic_did.py:L248-L272, diff_diff/synthetic_did.py:L790-L799, diff_diff/synthetic_did.py:L930-L959, docs/methodology/REGISTRY.md:L1564-L1588). That is an undocumented methodology mismatch on a newly supported surface. Concrete fix: either synthesize a single stratum and route PSU/FPC-only placebo fits through the survey allocator so they actually use the documented weighted-FW path, or explicitly narrow the public support matrix and add a REGISTRY.md note/deviation stating that the placebo survey upgrade only applies when strata is present.
  • Severity: P3-informational. Impact: the poor stratified_survey × jackknife calibration is already documented as a known few-PSU limitation of the Rust-Rao formula, not a newly introduced correctness bug (docs/methodology/REGISTRY.md:L1604-L1612, docs/methodology/REGISTRY.md:L1635-L1635). Concrete fix: none required.

Code Quality

  • Severity: P1. Impact: the new get_loo_effects_df() blocker keys off survey_metadata.n_psu, but pweight-only survey fits also populate n_psu via the library’s implicit-PSU metadata path, so the accessor now raises NotImplementedError even on the old unit-level jackknife implementation (diff_diff/results.py:L1144-L1158, diff_diff/survey.py:L749-L753, diff_diff/synthetic_did.py:L991-L993). This regresses an existing diagnostic surface on a previously supported path. Concrete fix: store an explicit LOO granularity flag from fit() (unit vs psu) and only block the PSU-level case; add a regression test for variance_method="jackknife" with SurveyDesign(weights="weight") asserting that get_loo_effects_df() still returns the unit-level DataFrame.

Performance

No material findings in the touched variance code.

Maintainability

No material maintainability finding beyond the documentation/message drift noted below.

Tech Debt

No blocker-level tech-debt finding. Removing the old SDID survey-gap row from TODO.md is appropriate, but it does not mitigate the live P1 issues above.

Security

No security findings in the touched code/docs/artifact updates.

Documentation/Tests

  • Severity: P3. Impact: several touched messages/docs still describe the wrong post-PR contract: _placebo_variance_se() still warns that jackknife is “pweight-only only” (diff_diff/synthetic_did.py:L1634-L1650), _jackknife_se_survey()’s docstring still says degenerate LOOs are “skipped” even though the implementation now makes SE undefined (diff_diff/synthetic_did.py:L2078-L2092), and the stratified_survey placebo narrative labels the zero-control infeasibility as Case C instead of Case B (benchmarks/python/coverage_sdid.py:L11-L18, benchmarks/python/coverage_sdid.py:L247-L252, docs/methodology/REGISTRY.md:L1637-L1637). Concrete fix: align these strings with the current behavior and the Case B/Case C definitions already stated in the REGISTRY.

Path to Approval

  1. Resolve the placebo methodology mismatch for PSU/FPC-only survey designs: either route them through a single-stratum weighted-FW survey placebo path, or explicitly narrow/document the supported contract in REGISTRY.md and the public fit() docstring. Add a regression test with SurveyDesign(weights="weight", psu="psu").
  2. Replace the survey_metadata.n_psu heuristic in get_loo_effects_df() with an explicit unit-vs-PSU replicate-level flag from fit(), and add a regression test proving that pweight-only survey jackknife results still expose unit-level LOO diagnostics.

… path; explicit LOO granularity flag

P1 (Methodology — PSU/FPC-only placebo mismatch with documented contract):
The dispatcher previously routed placebo to ``_placebo_variance_se_survey``
only when ``strata`` was present. PSU-only and FPC-only designs fell
through to the non-survey ``_placebo_variance_se`` path — silently
inconsistent with REGISTRY §SyntheticDiD "Note (survey + placebo
composition)" and the ``fit()`` docstring, which document the
weighted-FW stratified-permutation allocator for any full-design
survey (strata OR PSU OR FPC).

Fix: gate the placebo survey dispatch on ``_full_design_survey`` (the
same flag already used for bootstrap and jackknife). For
PSU/FPC-without-strata designs, ``fit()`` synthesizes a single
stratum (``_strata_control_eff = zeros(n_control)``,
``_strata_treated_eff = zeros(n_treated)``) so the stratified-
permutation allocator degenerates to a global within-stratum
permutation dispatched through the weighted-FW path. Jackknife
dispatch was already stratum-synthesizing; unified both methods on
the same ``_strata_*_eff`` arrays.

New regression ``test_placebo_full_design_psu_only_routes_through_survey_path``
monkeypatches both placebo methods with distinct sentinels and
asserts ``SurveyDesign(weights=..., psu=...)`` (no strata) dispatches
to the survey method on SE magnitude.

P1 (Code Quality — get_loo_effects_df over-broad block):
The R1 fix keyed the accessor guard off ``survey_metadata.n_psu is not
None``. But pweight-only survey fits populate ``n_psu`` too (via the
implicit-PSU metadata path in ``survey.py`` L749-L753); the guard
would false-positive and raise ``NotImplementedError`` on the
previously-supported unit-level LOO diagnostics.

Fix: add an explicit ``_loo_granularity`` attribute on
``SyntheticDiDResults`` set by ``fit()`` to ``"unit"`` (non-survey or
pweight-only jackknife — classical Algorithm 3 unit-level LOO),
``"psu"`` (full-design survey jackknife — PSU-level LOO), or
``None`` (non-jackknife variance methods). ``get_loo_effects_df()``
now keys the raise off ``_loo_granularity == "psu"`` rather than
``survey_metadata.n_psu``.

Two regression tests:
* ``test_get_loo_effects_df_raises_on_survey_jackknife`` — verifies
  ``_loo_granularity == "psu"`` on a full-design fit and that the
  accessor raises ``NotImplementedError`` with the PSU-level pointer
  message.
* ``test_get_loo_effects_df_works_on_pweight_only_jackknife`` —
  verifies ``_loo_granularity == "unit"`` on a pweight-only fit and
  that the accessor returns a unit-indexed DataFrame with the
  expected schema (columns ``unit``, ``role``, ``att_loo``,
  ``delta_from_full``; length ``n_control + n_treated``).

P3 (Documentation — stale messages after R1 fix):
* ``_placebo_variance_se``'s fallback warning (two sites) described
  jackknife as "pweight-only only" — no longer true after this PR.
  Rewrote to describe both bootstrap and jackknife as supporting
  full survey designs (with the jackknife few-PSU caveat pointing to
  REGISTRY).
* ``_jackknife_se_survey``'s docstring described "Degenerate LOOs
  are skipped per iteration" — stale after the R1 P0 fix switched to
  "undefined-LOO → SE=NaN + targeted UserWarning". Rewrote the
  bullet to describe the four undefined-replicate conditions and the
  NaN-return semantics, distinguishing them from the silent
  stratum-skip for ``n_h < 2`` (lonely-PSU case).
* ``coverage_sdid.py`` module docstring and ``REGISTRY.md`` placebo
  calibration-row narrative labeled the ``stratified_survey`` placebo
  infeasibility as "Case C" (fewer controls than treated). Correct
  label is **Case B** (zero controls in a treated-containing stratum)
  — the DGP packs all treated into stratum 1, which has 0 never-
  treated units.

Verification
------------
pytest tests/test_survey_phase5.py
       tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
→ 89 passed (2 new tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ddb77b2ac3ae92ff4467f8ba76d0efa5dd2c639a


Overall Assessment

Looks good

Executive Summary

  • The two prior P1s look resolved. Full-design placebo now routes any strata/PSU/FPC design through the survey allocator, including PSU/FPC-only designs via a synthesized single stratum (diff_diff/synthetic_did.py:L790-L958, tests/test_survey_phase5.py:L858-L890).
  • The get_loo_effects_df() regression is also fixed: fit() now stores an explicit _loo_granularity, and the accessor only blocks the PSU-level survey jackknife path, not pweight-only jackknife (diff_diff/synthetic_did.py:L1072-L1084, diff_diff/results.py:L1142-L1165, tests/test_survey_phase5.py:L1065-L1123).
  • Methodologically, the new survey jackknife/placebo paths stay aligned with the base SDID variance structure: Algorithm 3 is fixed-weight jackknife, Algorithm 4 is placebo reassignment over controls, and the survey-specific allocator asymmetry is explicitly documented in REGISTRY.md rather than introduced silently. (arxiv.org)
  • The earlier silent-under-scaling jackknife issue is fixed: undefined delete-one replicates in contributing strata now return SE=NaN with a warning instead of a finite SE (diff_diff/synthetic_did.py:L2218-L2408, tests/test_survey_phase5.py:L1125-L1154).
  • Remaining issues are documentation-only: a few touched docs/messages still describe SDID survey support as bootstrap-only, and two narratives still mislabel the infeasible placebo DGP as Case C instead of Case B.
  • I could not run the test suite in this environment because pytest and project deps such as numpy are unavailable.

Methodology

  • No unmitigated P0/P1 findings.
  • Severity: P3-informational. Impact: The very anti-conservative stratified_survey × jackknife row is explicitly documented in docs/methodology/REGISTRY.md:L1594-L1609 and docs/methodology/REGISTRY.md:L1635-L1637 as a few-PSU limitation of the Rust-Rao-style survey jackknife, so I am not treating it as a defect. Concrete fix: None required.

Code Quality

No material findings.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings. Removing the old TODO entry is consistent with the code/tests now present.

Security

No material findings.

Documentation/Tests

  • Severity: P3. Impact: Internal docs/messages still contradict the new support matrix. docs/methodology/survey-theory.md:L358-L368 still says SDID survey support is bootstrap-only, and diff_diff/synthetic_did.py:L306-L347 still documents/prints that full-design placebo/jackknife are unsupported or pweight-only. Concrete fix: update those strings so they say replicate-weight designs remain unsupported, but pweight-only and full strata/PSU/FPC designs are now supported for all three SDID variance methods.
  • Severity: P3. Impact: The stratified_survey placebo infeasibility is still mislabeled as Case C in benchmarks/python/coverage_sdid.py:L249-L255 and tests/test_methodology_sdid.py:L3516-L3521, while the implemented guard and REGISTRY define this zero-control condition as Case B. Concrete fix: rename those references to Case B so the narrative matches the code and REGISTRY.md.

… labels

R3 approved the PR with only P3 documentation nits remaining. Stragglers
fixed:

- ``diff_diff/synthetic_did.py`` ``fit()`` docstring Raises clause and
  replicate-weight-rejection error message still described the
  placebo/jackknife full-design paths as unsupported / pweight-only.
  Rewritten to describe the current contract (all three variance methods
  accept pweight-only and full strata/PSU/FPC; only replicate-weight
  designs remain rejected).
- ``docs/methodology/survey-theory.md`` §4.2a "Where the IF chain does
  not apply" still said SyntheticDiD survey support is bootstrap-only.
  Rewritten to describe all three survey allocators (bootstrap hybrid
  pairs-bootstrap + Rao-Wu + weighted FW; placebo stratified permutation
  + weighted FW; jackknife PSU-level LOO + Rust-Rao aggregation).
- ``benchmarks/python/coverage_sdid.py`` ``_stratified_survey_design``
  docstring and ``tests/test_methodology_sdid.py::TestCoverageMCArtifact``
  narrative labeled the ``stratified_survey`` placebo infeasibility as
  "Case C" (fewer controls than treated). Correct label is **Case B**
  (zero controls in a treated-containing stratum) — the DGP packs all
  treated units into stratum 1, which has 0 never-treated units.

Verification: 89 passed (no behavior change; docs/messages only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 473c6d7d20ab2511e131b0347ee07288b4242fed


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The two prior substantive re-review issues appear resolved: full-design survey dispatch now covers PSU/FPC-only designs, and get_loo_effects_df() is blocked only for PSU-level survey jackknife fits.
  • The base SDID pieces still line up with the source paper: Algorithm 3 keeps ω, λ fixed during jackknife LOO, and Algorithm 4 builds placebo draws by sampling control units without replacement. The survey-specific PSU/strata adaptations are documented in docs/methodology/REGISTRY.md, so I am not treating allocator asymmetry or the documented few-PSU jackknife undercoverage as defects. (nber.org)
  • P1: the new survey-placebo feasibility guard is still too weak. It allows exact-count treated strata (n_controls_h == n_treated_h), which collapses the stratified placebo support to a single admissible allocation, but the code treats that case as supported and the new smoke test asserts a positive finite SE on exactly such a fixture (diff_diff/synthetic_did.py:L833-L861, tests/test_survey_phase5.py:L21-L58, tests/test_survey_phase5.py:L211-L238).
  • Remaining cleanup is documentation-only: docs/methodology/survey-theory.md still labels SyntheticDiD as “Bootstrap only,” and tests/test_methodology_sdid.py still comments that the survey DGP is bootstrap-only (docs/methodology/survey-theory.md:L689-L704, tests/test_methodology_sdid.py:L3500-L3504).
  • I could not run the test suite here because the environment is missing project deps (numpy).

Methodology

  • Severity: P1. Impact: diff_diff/synthetic_did.py:L833-L861 only rejects n_controls_h < n_treated_h, but Algorithm 4’s placebo construction samples treated-sized control sets without replacement. When a treated-containing stratum has exactly n_controls_h == n_treated_h, that stratum contributes no randomization; if every treated stratum is exact-count, the placebo null collapses to one allocation. The PR’s own “succeeds” smoke test uses exactly that layout (tests/test_survey_phase5.py:L21-L58, tests/test_survey_phase5.py:L211-L238), so the new support claim overstates what the allocator can meaningfully support. Concrete fix: tighten the front-door guard to require at least two distinct stratified placebo allocations overall (or at minimum at least one treated stratum with n_controls_h > n_treated_h), and return a targeted ValueError or NaN+warning for the degenerate case. Update the smoke test to use sdid_survey_data_full_design or explicitly assert the degenerate exact-count behavior instead. (nber.org)

Code Quality

No material findings.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings.

Security

No material findings.

Documentation/Tests

  • Severity: P3. Impact: a few touched narratives still contradict the shipped surface. docs/methodology/survey-theory.md:L703-L704 still says SyntheticDiD | Bootstrap only, and tests/test_methodology_sdid.py:L3500-L3504 still describes stratified_survey as bootstrap-only. Concrete fix: update those remaining references to match the new support matrix and current artifact expectations.

Path to Approval

  1. Add an explicit full-design placebo degeneracy check for exact-count treated strata / one-allocation support in diff_diff/synthetic_did.py.
  2. Replace tests/test_survey_phase5.py::test_full_design_placebo_succeeds with a truly non-degenerate fixture (sdid_survey_data_full_design) or assert the expected warning/NaN behavior on the exact-count fixture.
  3. Clean up the remaining bootstrap-only wording in docs/methodology/survey-theory.md and tests/test_methodology_sdid.py.

…ta; non-degenerate test fixture

P1 (Methodology — degenerate exact-count placebo strata):
The Case B / Case C front-door guards rejected ``n_c_h == 0`` and
``n_c_h < n_t_h`` respectively, but allowed ``n_c_h == n_t_h``. For
the stratified-permutation allocator, the per-stratum support is
``C(n_c_h, n_t_h)``: when every treated-containing stratum has
``n_c_h == n_t_h``, the only allocation is to pick all ``n_c_h``
controls as pseudo-treated on every draw. All placebo draws produce
the same pseudo-treated set, the placebo null collapses to a single
point, and SE equals FP noise (~1e-16) from the np.average call
order-dependence. A naïve ``result.se > 0`` check spuriously passes.

Concretely, ``sdid_survey_data`` (stratum 0: 5 treated + 5 controls,
stratum 1: 10 controls, 0 treated) would return SE ≈ 3.79e-16 from
placebo, and the R2/R3-era ``test_full_design_placebo_succeeds``
test was passing only because of that sub-ULP noise — the test
assertion ``result.se > 0`` is satisfied even when the semantic SE
is zero.

Fix: add a Case D fit-time guard that rejects the design when every
treated-containing stratum has exactly ``n_c_h == n_t_h``. At least
one treated stratum must have ``n_c_h > n_t_h`` for the overall
permutation support (``∏_h C(n_c_h, n_t_h)``) to be ≥2.
ValueError message enumerates the per-stratum (n_c, n_t) counts
and points to ``variance_method='bootstrap'`` as the unconstrained
alternative.

Test changes:
* ``test_full_design_placebo_succeeds`` switched from
  ``sdid_survey_data`` (degenerate exact-count) to
  ``sdid_survey_data_full_design`` (stratum 0: 5 treated + 10 controls
  → ``C(10, 5) = 252`` distinct allocations). Tightened the SE
  assertion from ``> 0`` to ``> 1e-6`` so future regressions back to
  sub-ULP-noise SE fail loudly.
* New ``test_placebo_full_design_raises_on_exact_count_stratum``
  asserts the Case D ValueError fires on the old
  ``sdid_survey_data`` fixture (the regression target that surfaced
  this issue).

P3 (Documentation — remaining bootstrap-only stragglers):
* ``docs/methodology/survey-theory.md`` §"Estimator survey variance
  dispatch" table row for SyntheticDiD still said "Bootstrap only".
  Updated to "Bootstrap / permutation / PSU-LOO" with a note that
  all three variance methods support full strata/PSU/FPC designs.
* ``tests/test_methodology_sdid.py::TestCoverageMCArtifact``
  comment described ``stratified_survey`` as "bootstrap-only —
  placebo and jackknife reject strata/PSU/FPC at fit-time". Updated
  to reflect current state: bootstrap is the validation gate,
  jackknife is reported with anti-conservatism caveat, placebo is
  skipped due to DGP-specific Case B (all-treated-stratum packs).

Verification: 90 passed (1 new Case D regression test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f039e2fcf9bedd209e77094e94e2e1904ae27e7f


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review issue on placebo degeneracy is resolved: SyntheticDiD.fit() now front-door rejects the exact-count survey placebo case before entering the retry loop, and the bootstrap-only documentation was updated to reflect full-design placebo/jackknife support.
  • The new allocator asymmetry is documented in docs/methodology/REGISTRY.md, so I am not treating placebo’s within-stratum unit permutation vs jackknife’s PSU-level LOO as a defect.
  • P1: diff_diff/synthetic_did.py:L2401-L2444 converts legitimate zero survey-jackknife variance into SE=NaN, which contradicts the documented Rust-Rao formula and the library’s broader zero-variance survey convention.
  • P1: diff_diff/synthetic_did.py:L941-L955 / L2277-L2281 ignore SurveyDesign.lonely_psu on the new full-design jackknife path, so lonely_psu='certainty' and 'adjust' silently behave like 'remove'.
  • P3: docs/methodology/REGISTRY.md:L1588-L1592 and L1647 still lag the shipped behavior: the new Case D placebo guard is undocumented there, and get_loo_effects_df() is no longer available for PSU-level survey jackknife.
  • I could not run the touched tests here because this environment is missing numpy and pytest.

Methodology

  • Severity: P1. Impact: diff_diff/synthetic_did.py:L2401-L2444 marks a stratum as contributing once all n_h delete-one replicates are defined, but then collapses any total_variance <= 0.0 into SE=NaN with the warning “every stratum was skipped.” Under the documented survey jackknife contract in docs/methodology/REGISTRY.md:L1594-L1606, that is wrong for legitimate zero-variance cases such as full-census FPC (f_h = 1) or exact-zero within-stratum dispersion: the Rust-Rao formula gives variance 0, not “undefined.” Concrete fix: return se = 0.0 whenever at least one stratum contributed and the summed Rust-Rao variance is exactly zero; reserve NaN for the already-documented undefined cases (all strata skipped, undefined replicate). Add a regression with the existing well-formed jackknife fixture and fpc = n_h to assert se == 0.0.
  • Severity: P1. Impact: the new survey jackknife path never receives the user’s lonely_psu setting (diff_diff/synthetic_did.py:L941-L955), and singleton strata are always skipped unconditionally (diff_diff/synthetic_did.py:L2277-L2281). That silently violates the general survey contract documented at docs/methodology/survey-theory.md:L526-L538 and docs/methodology/REGISTRY.md:L2874-L2879, where lonely_psu supports "remove", "certainty", and "adjust". On this path, "certainty" and "adjust" are effectively ignored, which can understate SE or flip what should be a zero-variance certainty case into NaN. Concrete fix: thread resolved_survey_unit.lonely_psu into _jackknife_se_survey() and implement the three singleton-stratum modes there, or explicitly reject lonely_psu != 'remove' until those modes are supported.

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings beyond the documentation drift noted below.

Tech Debt

  • No separate deferrable findings. The two P1 items above are correctness/methodology issues and should not be pushed to TODO.md.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: docs/methodology/REGISTRY.md:L1588-L1592 still documents only Case B/C placebo feasibility, but the code now also rejects the exact-count Case D degeneracy at diff_diff/synthetic_did.py:L865-L895. Separately, docs/methodology/REGISTRY.md:L1647 still describes get_loo_effects_df() as the generic jackknife diagnostic, but diff_diff/results.py:L1154-L1165 now raises NotImplementedError for PSU-level survey jackknife. Concrete fix: update those registry bullets so the docs match the shipped surface.

Path to Approval

  1. In _jackknife_se_survey(), distinguish “zero computed variance” from “undefined variance”: return se=0.0 for legitimate zero-variance Rust-Rao outcomes, and add a regression for full-census FPC (fpc = n_h) on the well-formed survey jackknife fixture.
  2. Honor SurveyDesign.lonely_psu on the survey jackknife path, or reject non-remove modes explicitly. Add tests covering a singleton-stratum design under lonely_psu='certainty' and lonely_psu='adjust'.

…; REGISTRY docs

P1 (Methodology — zero computed variance conflated with undefined):
``_jackknife_se_survey`` previously collapsed ``total_variance <= 0.0``
into ``SE=NaN`` with an "every stratum was skipped" warning. That is
correct for the "no stratum contributed" branch (undefined per Rust &
Rao) but wrong for legitimate zero-variance outcomes: full-census FPC
(``fpc[h] == n_h`` → ``f_h = 1`` → ``(1 - f_h) = 0`` zeros every
stratum contribution even when within-stratum dispersion is non-zero)
and exact-zero within-stratum dispersion both give
``total_variance = 0`` by construction, not by "undefined".

Fix: split the terminal branch. Return ``SE=NaN`` only when no stratum
contributed; otherwise return ``SE = sqrt(max(total_variance, 0.0))``.
The ``max(..., 0.0)`` protects against sub-FP-epsilon negatives and
preserves the legitimate zero case at bit precision.

New regression
``test_jackknife_full_design_full_census_fpc_returns_zero_se``:
fits on ``sdid_survey_data_jk_well_formed`` with ``fpc=3`` (n_h=3
per stratum → f_h=1 → zero SE by design). Asserts
``result.se == 0.0`` (not NaN).

P1 (Methodology — lonely_psu silently ignored on jackknife path):
The full-design jackknife always skipped singleton strata (``n_h <
2``) unconditionally, regardless of the user's
``SurveyDesign(lonely_psu=...)`` choice. ``"certainty"`` and
``"adjust"`` were silently degraded to ``"remove"``, which understates
SE when the user intended ``"certainty"`` (equivalent to skip on
jackknife) or flips what should be a zero-variance certainty case
into NaN otherwise.

Fix: validate ``resolved_survey_unit.lonely_psu`` at fit-time on the
survey jackknife path. ``"remove"`` and ``"certainty"`` are both
accepted (they produce the same SE on this path — singleton strata
contribute 0 variance under both, matching canonical Rust & Rao /
``survey::svyjkn`` behavior for JKn). ``"adjust"`` (R's overall-mean
fallback for singleton strata) is rejected with
``NotImplementedError`` and a targeted message pointing to bootstrap
as the unconstrained alternative.

Two regressions:
* ``test_jackknife_full_design_lonely_psu_adjust_raises`` —
  verifies the rejection message.
* ``test_jackknife_full_design_lonely_psu_certainty_equivalent_to_remove``
  — asserts ``SE_remove == SE_certainty`` at ``rel=1e-14`` on the
  well-formed fixture.

P3 (Documentation — REGISTRY lag):
* Placebo feasibility Notes documented Cases B and C but missed Case D
  (the exact-count degeneracy guard added in R4). Split the "Fit-time
  feasibility guards" paragraph into an explicit 3-case enumeration
  (B: zero-control-stratum; C: undersupplied stratum; D: all-exact-
  count strata → single allocation).
* ``get_loo_effects_df()`` description still said "Requires
  variance_method='jackknife'; raises ValueError otherwise." after R2
  taught it to also raise ``NotImplementedError`` on PSU-level survey
  jackknife. Rewrote to distinguish unit-level (available) vs PSU-
  level (blocked, with pointer to ``result.placebo_effects``).
* Added a Zero-variance-vs-undefined distinction paragraph and a
  "lonely_psu contract" paragraph to the jackknife survey Note,
  matching the shipped behavior from the two P1 fixes above.

Verification: 93 passed (3 new regressions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ffd2e50bc16baa05834bdb38cccb9d95a7aafb13


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior lonely_psu propagation issue is resolved: the full-design survey jackknife now explicitly rejects lonely_psu='adjust' and no longer silently treats it like "remove".
  • The prior registry drift on Case D placebo degeneracy and PSU-level get_loo_effects_df() is mostly resolved in REGISTRY.md.
  • P1: the survey jackknife zero-variance fix is incomplete. Full-census strata (f_h = 1) can still return SE=NaN if a delete-one replicate is undefined, even though Rust-Rao makes that stratum’s variance contribution exactly zero.
  • P3: several user-facing docs still lag the shipped behavior, including the jackknife zero-variance semantics and the PSU-level get_loo_effects_df() restriction.
  • I could not run the touched tests here because this environment is missing both numpy and pytest.

Methodology

  • Severity: P1. Impact: SyntheticDiD._jackknife_se_survey() still evaluates delete-one feasibility inside full-census strata and returns SE=NaN before it reaches the new zero-variance path. Concretely, f_h is computed first, but even when f_h == 1 the code enters the LOO loop, can hit the “deletion removes all treated units” / undefined-replicate branch, and exits with NaN. Under the documented Rust-Rao formula, a stratum with (1 - f_h) = 0 contributes zero variance regardless of replicate dispersion, so these fits should return SE=0.0, not NaN. This means the previous zero-variance fix is only partial. Concrete fix: short-circuit strata with f_h >= 1 before any delete-one feasibility checks, and add a regression that uses the existing treated-in-one-PSU fixture with fpc = n_h to assert SE=0.0 even when dropping that PSU would otherwise make the replicate undefined. diff_diff/synthetic_did.py:L2308-L2458, diff_diff/synthetic_did.py:L2471-L2479, docs/methodology/REGISTRY.md:L1609-L1615, tests/test_survey_phase5.py:L1164-L1203

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings beyond the documentation drift noted below.

Tech Debt

  • No separate deferrable findings. The P1 above is a variance-correctness issue and should not be pushed to TODO.md.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: the shipped docs still disagree on survey jackknife zero-variance semantics. CHANGELOG.md says “total-zero-variance → NaN + UserWarning,” but the code and registry now document legitimate zero variance as SE=0.0. Concrete fix: update the changelog entry to match the implemented/registered behavior. CHANGELOG.md:L43-L49, docs/methodology/REGISTRY.md:L1611-L1615
  • Severity: P3. Impact: the jackknife diagnostics docs are still stale outside the registry. llms-full.txt still presents get_loo_effects_df() as generic per-unit jackknife output, and the in-code docstring still describes only the unit-level join even though full-design survey jackknife now raises NotImplementedError for PSU-level replicates. High-level survey docs also read as unconditional full support without surfacing the lonely_psu='adjust' exception. Concrete fix: sync these summaries with the registry and the new results.py behavior. diff_diff/guides/llms-full.txt:L1033-L1035, diff_diff/results.py:L1108-L1134, docs/survey-roadmap.md:L47-L65, docs/methodology/survey-theory.md:L369-L377, docs/methodology/survey-theory.md:L767-L781

Path to Approval

  1. In _jackknife_se_survey(), treat full-census strata (f_h >= 1) as zero-contribution before any delete-one feasibility checks, then add a regression using the treated-in-one-PSU full-design fixture with fpc = n_h to verify the method returns SE=0.0 rather than NaN in that census edge case.

…sync

P1 (Methodology — full-census strata could still return NaN under
undefined-LOO):
The R5 zero-variance-vs-NaN fix correctly returned ``SE=0`` when
``total_variance == 0`` and at least one stratum contributed, but the
LOO feasibility loop still ran per-stratum regardless of ``f_h``. If a
full-census stratum (``f_h ≥ 1`` → ``(1 - f_h) ≤ 0`` zeros its
variance contribution) ALSO had an undefined delete-one replicate
(e.g., all treated in the dropped PSU), the code exited via the
undefined-replicate branch with ``SE=NaN`` — wrong, because the
stratum's contribution is mathematically zero regardless of
replicate feasibility.

Fix: short-circuit strata with ``f_h >= 1.0`` before the delete-one
feasibility loop. Mark as contributing (so ``any_stratum_contributed``
becomes True), skip LOO computation, continue to the next stratum.

New regression ``test_jackknife_full_design_full_census_short_circuits_undefined_loo``:
uses ``sdid_survey_data_full_design`` (all 5 treated in stratum 0
PSU 0 — LOO PSU 0 removes all treated, triggering the undefined-
replicate branch in non-full-census fits) with ``fpc = n_h = 3``
(full census) and asserts ``SE == 0.0``, not NaN.

P3 (Documentation — stale zero-variance + PSU-LOO wording):
* ``CHANGELOG.md`` still said "total-zero-variance → NaN + UserWarning"
  (R5-era). Rewrote to spell out the full contract: legitimate zero
  variance → ``SE=0``; undefined replicates / all-strata-skipped →
  ``SE=NaN`` + targeted warning; full-census short-circuit + lonely_psu
  ``"remove"``/``"certainty"`` acceptance vs ``"adjust"`` rejection.
  Also enumerated Case B/C/D on the placebo feasibility line.
* ``diff_diff/results.py::get_loo_effects_df`` docstring described
  only the unit-level unit-id-join behavior; after the R2 fix the
  accessor raises ``NotImplementedError`` on PSU-level survey
  jackknife. Rewrote docstring with explicit "Available on" /
  "Blocked on" sections pointing to ``result.placebo_effects`` for the
  raw PSU-level replicate array.
* ``diff_diff/guides/llms-full.txt`` ``get_loo_effects_df()`` bullet
  still described it as generic unit-level only; updated to call out
  the NotImplementedError on full-design survey jackknife (PSU-level
  replicates).
* ``docs/survey-roadmap.md`` Phase 5 SDID row and
  ``docs/methodology/survey-theory.md`` §4.2b PSU-level LOO bullet
  updated to surface (a) the ``lonely_psu="adjust"`` rejection, (b)
  the full-census short-circuit, (c) the Case D placebo guard, and
  (d) the zero-variance-vs-NaN distinction — all aligned with the
  shipped behavior and REGISTRY.

Verification: 115 passed (1 new full-census regression; all previously
passing tests plus guides unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6d27a573f99ee41643e2174a018ebbffb10376a5


Overall Assessment

Looks good

Executive Summary

  • The prior P1 is resolved: _jackknife_se_survey() now short-circuits full-census strata before any delete-one feasibility checks, so f_h >= 1 contributes zero instead of leaking to SE=NaN. diff_diff/synthetic_did.py:L2320-L2334
  • I did not find any new unmitigated P0/P1 issues in the new survey placebo/jackknife estimators.
  • The methodology changes are aligned with the updated registry: survey placebo is implemented as stratified within-stratum permutation plus weighted FW, and survey jackknife as PSU-level LOO with Rust-Rao stratum aggregation. diff_diff/synthetic_did.py:L1760-L1943, diff_diff/synthetic_did.py:L2137-L2495, docs/methodology/REGISTRY.md:L1582-L1621
  • The remaining issue I found is informational: full-design survey jackknife still populates _loo_unit_ids, so repo guidance keyed off that sentinel will try get_loo_effects_df() and hit the new NotImplementedError. diff_diff/synthetic_did.py:L1065-L1067, diff_diff/synthetic_did.py:L1131-L1143, diff_diff/practitioner.py:L564-L566
  • I could not run the touched tests here because this environment is missing numpy and pytest.

Methodology

  • No material findings. The changed estimator code matches the documented contracts in REGISTRY.md, and the previously-blocking full-census jackknife bug appears fixed. diff_diff/synthetic_did.py:L827-L895, diff_diff/synthetic_did.py:L2320-L2334, docs/methodology/REGISTRY.md:L1582-L1621

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • Severity: P3. Impact: full-design survey jackknife results now set _loo_unit_ids/_loo_roles even though get_loo_effects_df() is intentionally unavailable on the "psu" path, so internal/canned guidance that uses _loo_unit_ids is not None as the availability check will still call the accessor and raise NotImplementedError. Concrete fix: only populate _loo_unit_ids/_loo_roles when _loo_granularity == "unit", or update the guidance to gate on _loo_granularity != "psu" / catch NotImplementedError. diff_diff/synthetic_did.py:L1065-L1067, diff_diff/synthetic_did.py:L1131-L1143, diff_diff/practitioner.py:L564-L566

Tech Debt

  • No material findings. Removing the old SDID survey placebo/jackknife TODO entry is consistent with the new shipped capability.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: the benchmark helper docstring in _fit_one() still says survey DGPs route through the bootstrap survey path only, which is stale after this PR added jackknife full-design support. Concrete fix: update the docstring to mention the new survey jackknife path or make it method-agnostic. benchmarks/python/coverage_sdid.py:L299-L303
  • I could not execute tests/test_survey_phase5.py or tests/test_methodology_sdid.py locally because numpy and pytest are unavailable in this environment.

…h harness docstring

P3 (Maintainability — survey jackknife still populated _loo_unit_ids):
``fit()`` was unconditionally setting ``_loo_unit_ids`` / ``_loo_roles``
on every jackknife fit, including full-design survey fits where the
underlying replicates are PSU-level and ``get_loo_effects_df()`` now
raises ``NotImplementedError``. Internal / canned guidance keyed off
``_loo_unit_ids is not None`` as the availability check (e.g.,
``practitioner.py``) would still call the accessor on a survey fit and
hit the new raise.

Fix: only populate ``_loo_unit_ids`` / ``_loo_roles`` when
``_loo_granularity == "unit"``; leave them ``None`` on the PSU path so
``_loo_unit_ids is not None`` correctly reports availability.
``_loo_granularity`` is the authoritative accessor gate; the legacy
``_loo_unit_ids`` sentinel now agrees with it.

P3 (Documentation — harness docstring stale):
``coverage_sdid.py::_fit_one`` docstring said "fit() routes [survey
designs] through the bootstrap survey path (PR #352) when
method=='bootstrap'" — stale after the placebo + jackknife full-
design paths landed. Rewrote to describe the three method-specific
survey variance paths (weighted-FW + Rao-Wu bootstrap; stratified-
permutation + weighted-FW placebo; PSU-LOO + stratum-aggregation
jackknife) and mention the Case B-D ValueError failure modes alongside
NotImplementedError.

Verification: 94 passed (no behavior change on the gating fix — it's
a state-gating tightening, not a correctness change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0bcda79bc87ca27fc6fa9b85592eaca8518a6b3b


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review items are addressed: PSU-level survey jackknife now gates get_loo_effects_df() via _loo_granularity, no longer leaves _loo_unit_ids populated on the PSU path, and the benchmark/helper text was updated to match. diff_diff/results.py:L1108-L1183, diff_diff/synthetic_did.py:L1131-L1155, benchmarks/python/coverage_sdid.py:L295-L307
  • The new survey jackknife implementation is otherwise aligned with the updated registry: fixed λ, PSU-level LOO, Rust-Rao stratum aggregation, full-census zero-variance handling, and lonely_psu="adjust" rejection are all implemented and documented. diff_diff/synthetic_did.py:L935-L981, diff_diff/synthetic_did.py:L2147-L2505, docs/methodology/REGISTRY.md:L1599-L1621
  • One unmitigated P1 remains on the new survey placebo path: the public contract now says variance_method="placebo" supports full strata/PSU/FPC designs, but the implementation never reads FPC. An fpc= column is enough to switch into the alternate weighted-FW placebo path even though sampling fraction never enters the placebo formula. diff_diff/synthetic_did.py:L744-L800, diff_diff/synthetic_did.py:L1001-L1020, diff_diff/synthetic_did.py:L1770-L1953, docs/methodology/REGISTRY.md:L1552-L1597
  • No additional code-quality, performance, maintainability, security, or tech-debt blockers stood out in the changed files.
  • The touched tests could not be executed locally because this environment is missing numpy, pandas, scipy, and pytest.

Methodology

  • Severity: P1. Impact: SyntheticDiD.fit() treats any fpc declaration as “full design” and routes variance_method="placebo" into _placebo_variance_se_survey(), but that method neither accepts nor uses FPC anywhere. The registry and survey-theory docs, however, advertise placebo support for full strata/PSU/FPC designs. As written, an fpc= column can change the placebo code path without the sampling fraction ever entering the placebo math, which is an undocumented methodology mismatch on a variance method. Concrete fix: either implement and document an FPC-aware placebo derivation, or narrow the placebo contract so fpc does not trigger the full-design placebo path (or raises a targeted error) and update the registry/support matrix accordingly. diff_diff/synthetic_did.py:L744-L800, diff_diff/synthetic_did.py:L1001-L1020, diff_diff/synthetic_did.py:L1770-L1953, docs/methodology/REGISTRY.md:L1552-L1597, docs/methodology/survey-theory.md:L751-L765

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • No material findings. Removing the old TODO entry is consistent with the shipped jackknife/placebo capability.

Security

  • No material findings.

Documentation/Tests

  • No separate material findings beyond the P1 above. The new tests do cover jackknife FPC behavior, but I could not execute them locally because the environment is missing numpy, pandas, scipy, and pytest. tests/test_survey_phase5.py:L1025-L1066, tests/test_survey_phase5.py:L1164-L1239

Path to Approval

  1. Decide the placebo+FPC contract. If FPC is not part of the placebo methodology, stop advertising fpc support on that path and make fpc either a documented no-op with parity to the intended route or a targeted front-door error.
  2. If FPC is intended to matter on placebo, thread it through a documented derivation in _placebo_variance_se_survey() rather than using it only as a dispatch trigger.
  3. Add a regression test for the chosen contract: the same panel under SurveyDesign(weights="weight") vs SurveyDesign(weights="weight", fpc="fpc") should either raise, be explicitly parity-matched, or show the intended FPC-adjusted behavior.

…no-op contract

P1 (Methodology — placebo dispatch flipped on FPC alone, but FPC
plays no role in placebo math):
The dispatcher gated placebo's survey-path routing on
``_full_design_survey = strata is not None OR psu is not None OR fpc
is not None``. Adding an ``fpc=`` column to a SurveyDesign therefore
silently switched dispatch from the non-survey placebo path
(unweighted-FW + post-hoc ω composition) to the weighted-FW survey
placebo path — different numerics — even though permutation tests are
conditional on the observed sample (Pesarin 2001 §1.5) and the
sampling fraction never enters Algorithm 4 or its stratified-
permutation survey extension. The reviewer correctly flagged this
as an undocumented methodology mismatch on a public variance method.

Fix:
* Gate ``_placebo_use_survey_path`` on ``strata is not None OR psu
  is not None`` (FPC dropped from the trigger). FPC alone now keeps
  placebo on the non-survey path with no numerical drift relative to
  the no-FPC fit.
* Emit a ``UserWarning`` whenever ``fpc`` is set with
  ``variance_method="placebo"``, regardless of whether ``strata`` or
  ``psu`` are also set, so users get an explicit signal that the
  FPC column is preserved in design metadata but does not enter
  placebo math. Recommends ``variance_method="bootstrap"`` or
  ``"jackknife"`` for FPC participation.
* REGISTRY §SyntheticDiD "Note (survey support matrix)" placebo
  bullet rewritten to spell out the contract: "for designs with
  explicit ``strata`` and/or ``psu`` … FPC is a documented no-op for
  placebo — permutation tests are conditional on the observed sample
  (Pesarin 2001 §1.5)."
* survey-theory.md placebo bullet picks up the same FPC no-op
  language plus the Case B/C/D guard enumeration from R5.

New regression
``test_placebo_fpc_alone_no_op_warns_and_matches_pweight_only``
asserts both contracts: (a) ``UserWarning`` fires when fpc is set
on placebo, (b) SE under ``SurveyDesign(weights, fpc)`` matches SE
under ``SurveyDesign(weights)`` at ``rel=1e-12`` (true no-op, not a
silent dispatch flip introducing weighted-FW drift).

Bootstrap and jackknife paths unchanged — they use FPC legitimately
(Rao-Wu rescaling for bootstrap, ``(1 - f_h)`` factor in the Rust &
Rao 1996 jackknife formula). Only placebo's contract narrows.

Verification: 95 passed (1 new FPC no-op regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cdb42fe8efc58be112a66bbe6d5886c2e413325c


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review items are addressed: placebo+FPC is now a documented no-op and no longer changes placebo dispatch, and full-design survey jackknife now blocks unit-level LOO accessors via _loo_granularity. diff_diff/synthetic_did.py:L808-L834, diff_diff/synthetic_did.py:L1165-L1189, diff_diff/results.py:L880-L890, diff_diff/results.py:L1108-L1187, tests/test_survey_phase5.py:L897-L953
  • The new survey jackknife path is otherwise consistent with the updated registry: fixed λ, PSU-level LOO, Rust-Rao stratum aggregation, undefined replicate => NaN, and full-census zero variance => 0.0. diff_diff/synthetic_did.py:L2181-L2515, docs/methodology/REGISTRY.md:L1599-L1623
  • [Newly identified] P1: the new survey placebo feasibility guard is row-count based, but the actual allocator uses survey-weighted pseudo-treated means. A treated-containing stratum can therefore pass the front-door check while having zero usable control survey mass, after which every draw fails inside _placebo_variance_se_survey() and the fit degrades to a generic warning plus SE=0.0 instead of the documented targeted rejection. diff_diff/synthetic_did.py:L867-L929, diff_diff/synthetic_did.py:L1903-L1972, docs/methodology/REGISTRY.md:L1582-L1597
  • The added placebo “stratum contract” regression does not inspect any sampled pseudo-treated indices, so it would not catch an allocator bug after dispatch. tests/test_survey_phase5.py:L760-L778, docs/methodology/REGISTRY.md:L1595-L1597
  • I could not execute the touched tests here because this environment is missing numpy, pandas, scipy, and pytest.

Methodology

  • Severity: P1. Impact: SyntheticDiD.fit() now documents Case B/C/D front-door placebo feasibility checks for the survey-weighted permutation extension, but the guard only counts raw controls/treated rows per stratum. _placebo_variance_se_survey() then computes pseudo-treated means with w_control and swallows ZeroDivisionError/zero-mass failures inside the retry loop. That means a treated-containing stratum with controls but zero usable survey mass is treated as “feasible” up front, then collapses later to n_successful < 2 and SE=0.0 rather than a targeted methodology error. This is a missing assumption check on the new survey placebo path, not a documented deviation. Concrete fix: make the survey-placebo front-door guard weight-aware for the support it actually uses; at minimum, reject treated-containing strata with zero control survey mass before entering the retry loop, and document/test the intended contract for zero-weight treated units. diff_diff/synthetic_did.py:L867-L929, diff_diff/synthetic_did.py:L1903-L1972, docs/methodology/REGISTRY.md:L1582-L1597

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • No material findings. Removing the old TODO row is appropriate once the survey-placebo support edge above is fixed.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: test_placebo_full_design_pseudo_treated_stays_within_treated_strata only verifies that _placebo_variance_se_survey() received the expected strata arrays; it never records any sampled pseudo_treated_idx, so it does not actually test the per-draw within-stratum permutation contract described in REGISTRY. Concrete fix: capture sampled pseudo-treated indices (or sampled strata per draw) inside the placebo survey loop and assert every sampled control comes from a treated-containing stratum. tests/test_survey_phase5.py:L760-L778, docs/methodology/REGISTRY.md:L1595-L1597
  • I could not run the touched tests locally because numpy, pandas, scipy, and pytest are not installed.

Path to Approval

  1. Extend the survey-placebo front-door feasibility guard so treated-containing strata with zero control survey mass raise a targeted ValueError before _placebo_variance_se_survey() enters its retry loop. diff_diff/synthetic_did.py:L867-L929, diff_diff/synthetic_did.py:L1903-L1972
  2. Add a regression where a treated-containing stratum has raw controls but all of those controls have weight=0; assert a targeted fit-time error rather than a late warning plus SE=0.0. tests/test_survey_phase5.py:L780-L953
  3. Strengthen the allocator regression to observe actual pseudo-treated draws, not just dispatch arguments. tests/test_survey_phase5.py:L760-L778

…raw allocator regression

P1 (Methodology — placebo Case E weight-aware feasibility):
The Case B / C / D fit-time guards count raw rows per stratum, but
the allocator computes pseudo-treated means as
``np.average(Y[:, pseudo_treated_idx], weights=w_control[pseudo_treated_idx])``.
A treated-containing stratum can pass row-count guards while having
fewer positive-weight controls than treated units — every draw can
then pick a pseudo-treated subset whose weights all sum to zero
(``ZeroDivisionError`` inside np.average), the per-draw retry loop
swallows the failure as a generic ``n_successful=0`` warning, and
the fit reports ``SE=0.0`` instead of a targeted methodology error.

Fix: add a Case E front-door guard that rejects any treated-
containing stratum with ``n_positive_weight_controls_h <
n_treated_h``. Ordered after Case B/C (row-count failures) so the
existing row-count error messages still fire when relevant; Case E
catches the remaining "rows present, weights insufficient" gap.

New regression
``test_placebo_full_design_raises_on_zero_weight_controls_in_stratum``:
zeros out ``weight`` for all stratum-0 controls (units 5-14) on
``sdid_survey_data_full_design`` (which has 10 stratum-0 controls,
5 treated). Row-count guards pass (10 ≥ 5) but Case E now rejects
with the targeted "at least n_treated controls with positive
survey weight" message instead of the late ``SE=0.0`` warning.

REGISTRY enumeration updated to four cases (B, C, E, D) with the
weight-aware language; Validation bullet bumped to reflect the new
regression.

P3 (Documentation/Tests — placebo allocator regression too weak):
``test_placebo_full_design_pseudo_treated_stays_within_treated_strata``
previously only recorded the dispatch arguments to
``_placebo_variance_se_survey``; it did not observe any actual
pseudo-treated indices, so it would not catch an allocator bug
inside the per-draw loop.

Fix: install a recording wrapper around ``np.random.default_rng``
that intercepts every ``rng.choice`` call inside the per-draw loop
and records the sampled control indices' stratum memberships. Assert
every recorded draw's sampled stratum ⊆ treated-strata set across
all 30 replications, directly verifying the within-stratum
permutation contract from REGISTRY.

Verification: 96 passed (2 new regressions; existing Case B/C/D
guards still fire on their fixtures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3399a711f7ef307f106d03ce0b8cae8677d7840d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review blockers appear addressed: the placebo survey path now front-door checks positive-weight control support, the allocator regression records actual sampled pseudo-treated draws, and full-design survey jackknife no longer exposes unit-level LOO diagnostics for PSU-level replicates. diff_diff/synthetic_did.py:L867-L935, tests/test_survey_phase5.py:L744-L880, diff_diff/results.py:L1170-L1187, diff_diff/synthetic_did.py:L1191-L1215
  • P1: the new placebo “FPC is a no-op” contract is not fully propagated. fit() still runs the implicit-PSU FPC validator before variance-method dispatch, so variance_method="placebo" can raise on low fpc values even though REGISTRY now says placebo ignores FPC. diff_diff/synthetic_did.py:L442-L490, diff_diff/synthetic_did.py:L792-L834, docs/methodology/REGISTRY.md:L1564-L1566
  • The new survey jackknife path otherwise matches the updated methodology note: PSU-level LOO, fixed λ, Rust-Rao stratum aggregation, FPC factor, and NaN on undefined delete-one replicates. diff_diff/synthetic_did.py:L2207-L2558, docs/methodology/REGISTRY.md:L1600-L1624
  • The new placebo FPC regression only covers an oversized valid fpc, so it misses the still-failing low-fpc, no-psu interaction above. tests/test_survey_phase5.py:L976-L1025, diff_diff/synthetic_did.py:L458-L490
  • Local execution was not possible here because numpy, pandas, scipy, and pytest are not installed.

Methodology

  • Severity: P1. Impact: SyntheticDiD.fit() still applies the bootstrap/jackknife-only implicit-PSU FPC check before it knows whether placebo will run. That means a placebo fit with SurveyDesign(..., fpc=...) and no explicit psu can fail on fpc < n_units / fpc < n_h_units, even though the new registry note explicitly says FPC is a placebo no-op and should only trigger a warning, not alter or block the Algorithm 4 path. This is an undocumented methodology/parameter-interaction mismatch in a newly added survey branch. Concrete fix: bypass the implicit-PSU FPC validator for variance_method="placebo" (or gate it to bootstrap/jackknife only), then let placebo continue to the documented no-op warning path. Update the fit docstring accordingly. diff_diff/synthetic_did.py:L284-L305, diff_diff/synthetic_did.py:L442-L490, diff_diff/synthetic_did.py:L792-L834, docs/methodology/REGISTRY.md:L1564-L1566

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • No material findings. Removing the old SDID survey-gap TODO is reasonable once the P1 above is fixed.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: the new placebo no-op regression only exercises fpc_col = 1000.0, so it never covers the low-fpc, no-psu case where the bootstrap-specific validator still leaks into placebo. That left the P1 above unguarded. Concrete fix: add a regression with variance_method="placebo" and SurveyDesign(weights="weight", fpc="fpc_col") where fpc_col is below the implicit-PSU threshold, and assert warning + parity to the pweight-only placebo fit rather than ValueError. A stratified no-psu variant would cover both branches of the validator. tests/test_survey_phase5.py:L976-L1025, diff_diff/synthetic_did.py:L458-L490
  • Local execution was not possible because numpy, pandas, scipy, and pytest are not installed.

Path to Approval

  1. Restrict the implicit-PSU FPC validation in diff_diff/synthetic_did.py:L458-L495 to variance methods that actually use FPC in variance computation (bootstrap and jackknife), so placebo reaches the documented no-op warning path.
  2. Add a placebo regression for low-fpc, no-psu designs in tests/test_survey_phase5.py and assert warning + numerical parity with the pweight-only placebo fit.

…trap/jackknife only

P1 (Methodology — implicit-PSU FPC validator leaked into placebo):
PR #355 R8 P1 added a fit-time validator that rejects ``psu=None``
+ ``fpc < n_units`` designs, because Rao-Wu bootstrap treats each
unit as its own PSU and would fail mid-draw with the bootstrap loop
swallowing the error as a generic exhaustion message. The validator
ran unconditionally on every survey fit.

After R8 documented FPC as a placebo no-op (Pesarin 2001 §1.5 —
permutation tests condition on the observed sample), this validator
became inconsistent: a placebo fit with low FPC and no explicit
``psu`` would still raise a "FPC must be ≥ n_units" error for a
constraint that doesn't apply to the placebo math.

Fix: gate the implicit-PSU FPC validator on
``self.variance_method in ("bootstrap", "jackknife")``. Both methods
genuinely consume FPC (Rao-Wu rescaling for bootstrap, Rust & Rao
``(1 - f_h)`` factor for jackknife). Placebo proceeds to the
documented no-op warning path regardless of FPC value.

New regression
``test_placebo_low_fpc_no_psu_warns_no_validator_block``:
sets ``fpc_col = 5`` (well below n_units=30) with no PSU. Asserts
(a) placebo fit succeeds, (b) emits the documented FPC-no-op
``UserWarning``, (c) SE matches the no-FPC pweight-only fit at
``rel=1e-12``, AND (d) bootstrap on the same low-FPC design still
raises the validator error (gating preserves bootstrap/jackknife
behavior — only placebo's FPC contract changes).

Verification: 97 passed (1 new low-FPC placebo regression; existing
bootstrap/jackknife FPC validation regressions still fire on their
fixtures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 312f78f73ca537c6494fc77b01a5d3ce3d29a6d4


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior blocker from the last review is fixed for the no-psu case: placebo now skips the fit-time implicit-PSU FPC validator, and there is a regression for low-FPC/no-PSU placebo.
  • P1 [Newly identified]: the new “fpc is a placebo no-op” contract is still incomplete. SurveyDesign.resolve() will still reject low FPC when psu is explicitly declared, so some placebo fits fail before they ever reach the new warning/no-op path.
  • P1: the new survey placebo guard still misses an effective single-support case when a treated stratum has only one positive-weight control plus additional zero-weight controls. In that shape, the retry loop can return SE=0.0 on a degenerate positive-mass null instead of rejecting it.
  • Aside from those survey-extension gaps, the new branches preserve the paper-level split correctly: Algorithm 3 is a fixed-weight jackknife, while Algorithm 4 recomputes SDID on placebo allocations. (nber.org)
  • Local execution was not possible because numpy, pandas, scipy, and pytest are not installed.

Methodology

  • Severity: P1 [Newly identified]. Impact: the documented placebo contract says fpc= is a no-op and should only warn, but that is only true when psu is absent. With explicit psu, SurveyDesign.resolve() still raises on fpc < n_psu, so variance_method="placebo" can fail on a design element that the new registry note says should not participate in placebo variance at all. This is the same parameter-interaction problem as the prior blocker, just one layer earlier in resolution. Refs: diff_diff/survey.py:L349-L368, diff_diff/synthetic_did.py:L825-L843, docs/methodology/REGISTRY.md:L1564-L1566, tests/test_survey_phase5.py:L976-L1108. Concrete fix: make placebo ignore FPC before/while resolving the survey design as well, not only in the later fit-time validator path; then add regressions for low-FPC placebo with explicit psu, both stratified and unstratified.

  • Severity: P1. Impact: Case D degeneracy is only checked on raw control counts, not on the effective positive-mass support actually used by the survey placebo estimator. If n_treated_h=1, a treated stratum has one positive-weight control, and the rest of its controls have zero weight, the current guards pass (n_c_h > n_t_h, n_c_h_positive == n_t_h), zero-mass draws are silently skipped in the loop, and every successful draw reuses the same positive-mass pseudo-treated allocation. That yields a degenerate null and can return SE=0.0 instead of the targeted front-door failure that Case D is meant to enforce. Refs: diff_diff/synthetic_did.py:L907-L944, diff_diff/synthetic_did.py:L1919-L2021, docs/methodology/REGISTRY.md:L1588-L1594. Concrete fix: define degeneration on effective positive-mass support as well as raw row counts, or sample placebo treated units only from positive-weight controls; add a regression with one positive-weight control and additional zero-weight controls in a treated stratum and assert a targeted failure rather than SE=0.0.

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • No separate deferrable findings. The two issues above are correctness/methodology blockers and are not TODO-eligible.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests cover the previously reviewed no-psu FPC leak and many jackknife survey edges, but they still do not cover the two remaining placebo gaps above: low-FPC with explicit psu, and effective single-support degeneration with zero-weight controls. Refs: tests/test_survey_phase5.py:L844-L1108. Concrete fix: add one regression for each case and keep the expected behavior aligned with the registry note (warning + success for FPC no-op; targeted failure for degenerate placebo support).

Path to Approval

  1. Make fpc= a true placebo no-op even when psu is explicitly declared, and add regressions for low-FPC placebo with explicit psu (with and without strata).
  2. Extend the placebo feasibility guard to catch effective positive-mass single-support cases, or restrict pseudo-treated sampling to positive-weight controls, and add a regression that currently collapses to SE=0.0 despite having only one successful positive-mass allocation.

…ctive-support guard

P1 #1 (FPC validator in SurveyDesign.resolve fires on placebo with
explicit psu):
The R10 fix gated the in-fit implicit-PSU FPC validator on
bootstrap/jackknife only, but ``SurveyDesign.resolve()`` itself
enforces ``FPC >= n_PSU`` design-validity (survey.py:349-368) before
``synthetic_did.fit()`` even sees the resolved object. So a placebo
fit with explicit ``psu`` and low ``fpc`` would still raise — same
parameter-interaction problem one layer earlier in resolution.

Fix: when ``variance_method == "placebo"`` and
``survey_design.fpc is not None``, construct an FPC-stripped copy of
the SurveyDesign (``dataclasses.replace(survey_design, fpc=None)``)
BEFORE calling ``_resolve_survey_for_fit``. Emit the FPC no-op
``UserWarning`` at the same time. The original ``survey_design``
object is preserved (caller's reference unchanged); the resolved
unit-level survey design carries no FPC on placebo, so the in-fit
validators (and the downstream FPC-related dispatch flags) all
correctly skip FPC handling.

The duplicate downstream FPC no-op warning (added in R8 keyed on
``resolved_survey_unit.fpc``) becomes unreachable on placebo and is
removed.

New regression
``test_placebo_low_fpc_with_explicit_psu_skips_resolve_validator``:
asserts (a) placebo with explicit psu + ``fpc < n_PSU`` succeeds
+ emits no-op warning, (b) SE matches the no-FPC fit at ``rel=1e-12``,
(c) bootstrap on the same low-FPC design still raises
``"FPC (2.0) is less than the number of PSUs"`` from
``SurveyDesign.resolve()`` — validator-skip is correctly variance-
method-gated.

P1 #2 (Case D missed effective single-support):
The Case D guard for placebo degeneracy keyed on raw control counts
(``n_c_h > n_t_h`` for at least one stratum). It missed the case
where ``n_c_h_positive < 2`` for every treated stratum: rows allow
multiple subsets, but every successful pseudo-treated mean reduces
to the unique positive-weight control's outcome (zero-weight
cohabitants contribute 0 to numerator and denominator, R11 P1).
The placebo null collapses to a single point and SE = FP noise.

Fix: extend the non-degeneracy invariant to require **both**
``n_c_h > n_t_h`` AND ``n_c_h_positive >= 2`` for at least one
treated stratum. The classical Case D shape (raw exact-count
``n_c_h == n_t_h``) and the new "effective single-support" shape
(positive-weight controls < 2 even with extra zero-weight rows) both
trigger Case D. Updated the Case D error message to enumerate
``n_c_positive`` alongside ``n_c`` / ``n_t`` per stratum.

New regression
``test_placebo_full_design_raises_on_effective_single_support``:
constructs a fixture with 1 treated unit + 1 positive-weight
control + 9 zero-weight controls in stratum 0; raw guards (B/C/E)
pass but Case D fires with the new "single distinct positive-mass
pseudo-treated mean" message.

Updated existing
``test_placebo_full_design_raises_on_exact_count_stratum`` regex
to match the new message (same Case D path, slightly different
wording).

REGISTRY §SyntheticDiD Case enumeration updated: Case D now
documents both the classical (``n_c == n_t``) and effective single-
support (``n_c_positive < 2``) shapes, with the combined non-
degeneracy invariant.

Verification: 98 passed (2 new regressions; existing Case B/C/E/D-
classical guards still fire on their fixtures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a17c8a04524a97d83517fb1d8716ae921847b054


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The two prior placebo issues from the last review look addressed: placebo now drops fpc before survey resolution, and the effective single-support zero-weight-control case is front-door rejected.
  • P1 [Newly identified]: the new survey jackknife mishandles lonely_psu="certainty" when every stratum is singleton. It falls through the “all strata skipped” branch and returns SE=NaN, but the project’s survey semantics and docs define certainty PSUs as zero-variance contributors, so this case should resolve to SE=0.0, not undefined.
  • Aside from that, I did not find another unmitigated methodology mismatch in the changed SDID survey placebo/jackknife logic.
  • No new inline inference / partial-NaN anti-patterns were introduced; the result path still goes through safe_inference().
  • Local execution was not possible here because pytest and numpy are not installed in the environment.

Methodology

  • Severity: P1 [Newly identified]. Impact: SyntheticDiD._jackknife_se_survey() accepts lonely_psu="certainty" as a supported mode, but singleton strata are always just skipped, and if every stratum is singleton the function returns SE=NaN via the “every stratum was skipped” path. That contradicts the library’s own survey contract for certainty PSUs, which says singleton certainty strata contribute zero variance, and it also contradicts the new SDID registry text claiming "remove" and "certainty" both treat singleton strata as zero contributors. This is an incorrect variance/SE surface for a supported survey mode. Concrete fix: in the survey jackknife path, distinguish "certainty" from "remove" for n_h < 2 strata and treat singleton certainty strata as legitimate zero-variance contributors; if all strata are certainty singletons, return SE=0.0 and let safe_inference() produce NaN inference fields. Refs: diff_diff/synthetic_did.py:L1048-L1071, diff_diff/synthetic_did.py:L2422-L2426, diff_diff/synthetic_did.py:L2595-L2605, docs/methodology/survey-theory.md:L526-L535, diff_diff/survey.py:L1519-L1525, diff_diff/survey.py:L1579-L1583, tests/test_survey.py:L2401-L2429, docs/methodology/REGISTRY.md:L1612-L1616.

Paper-level cross-checks otherwise look coherent: the jackknife path still treats learned SDID weights as fixed during variance estimation, which matches the paper’s large-sample inference framing, and the placebo path remains permutation-like rather than bootstrap-like; the survey-specific stratified-placebo / PSU-jackknife extension is a local adaptation documented in REGISTRY.md. (nber.org)

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • No material findings beyond the P1 above.

Tech Debt

  • No separate deferrable findings. The remaining issue is a correctness/methodology problem, not TODO-grade debt.

Security

  • No material findings.

Documentation/Tests

  • Severity: P3. Impact: the new jackknife tests cover "certainty" only on a non-singleton fixture, so they do not lock in the all-certainty edge case that the survey stack already treats as zero variance. That allowed the P1 above to remain undetected. Concrete fix: add a full-design SDID jackknife regression mirroring tests/test_survey.py::test_all_certainty_psu_zero_vcov, asserting result.se == 0.0 and NaN inference under lonely_psu="certainty" when every stratum has exactly one PSU. Refs: tests/test_survey_phase5.py:L1615-L1649, tests/test_survey.py:L2401-L2429.

Path to Approval

  1. Fix _jackknife_se_survey() so lonely_psu="certainty" treats singleton strata as zero-variance contributors instead of routing all-singleton designs to the “all strata skipped” NaN branch.
  2. Add a regression for full-design SDID jackknife with all singleton certainty strata and assert SE=0.0 plus NaN t_stat / p_value / CI.

…ve' on jackknife survey

P1 (Methodology — all-singleton certainty design returned NaN
instead of zero):
``_jackknife_se_survey`` treated ``lonely_psu="certainty"`` and
``"remove"`` as equivalent — both silently skipped singleton strata
(``n_h < 2``). When every stratum is singleton + certainty, the
fit fell through the "every stratum was skipped → NaN" branch
even though the library's broader survey contract (and
``tests/test_survey.py::test_all_certainty_psu_zero_vcov``) defines
certainty PSUs as zero-variance contributors: an all-certainty
design yields ``vcov = 0``, not NaN.

Fix: thread ``resolved_survey_unit.lonely_psu`` into
``_jackknife_se_survey``. Distinguish the singleton-stratum branch:
* ``"remove"`` (default): silent skip — matches R ``survey::svyjkn``
  lonely-PSU="remove". All-singleton design → ``SE = NaN`` (no
  contributing stratum).
* ``"certainty"``: stratum still adds 0 variance, but is marked
  ``any_stratum_contributed = True`` — explicit zero-variance
  contribution. All-certainty design → ``SE = 0.0`` (legitimate
  zero, downstream ``safe_inference`` propagates NaN to t-stat /
  p-value / CI as the SE=0 contract requires).

New regression
``test_jackknife_full_design_all_certainty_psu_returns_zero_se``:
mirrors ``test_all_certainty_psu_zero_vcov`` from the broader
survey suite. Constructs a 30-stratum 1-PSU/stratum design from
the well-formed jackknife fixture, asserts:
* ``"certainty"`` → ``SE = 0`` exactly, ``t_stat`` and ``p_value``
  NaN via ``safe_inference``;
* ``"remove"`` → ``SE = NaN`` with the "every stratum was skipped"
  warning.

Method signature: ``lonely_psu`` parameter added at the end (after
``fpc_treated``) to keep the existing arg order intact.

REGISTRY ``lonely_psu`` contract updated to spell out the
``"remove"`` vs ``"certainty"`` semantic split for all-singleton
designs.

Verification: 100 passed (1 new all-certainty regression; existing
``test_jackknife_full_design_lonely_psu_certainty_equivalent_to_remove``
still passes — on a fixture with at least one non-singleton
stratum, the two modes still produce the same SE because the
singleton-stratum branch isn't reached).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: fbdba345f2029a9337e98e0f31c8fc8ac25cce59


Overall Assessment

Looks good

Executive Summary

  • The prior re-review blocker is addressed: lonely_psu="certainty" now returns SE=0.0 for all-singleton stratified survey jackknife, matching the survey-core contract and covered by regression tests (diff_diff/synthetic_did.py:L2424-L2441, tests/test_survey_phase5.py:L1615-L1681, tests/test_survey.py:L2401-L2429).
  • I did not find any unmitigated P0/P1 methodology issues in the new SDID survey placebo/jackknife paths. The main survey extensions and documented deviations are now registered and implemented consistently (docs/methodology/REGISTRY.md:L1552-L1628, diff_diff/synthetic_did.py:L1881-L2064, diff_diff/synthetic_did.py:L2258-L2630).
  • Edge-case handling is materially stronger than in the previous review: undefined delete-one replicates now force SE=NaN, full-census FPC returns SE=0, lonely_psu="adjust" is rejected up front, placebo Case B/C/D/E are front-door guarded, and get_loo_effects_df() is blocked only on true PSU-level LOO (tests/test_survey_phase5.py:L1331-L1848, diff_diff/results.py:L1108-L1185).
  • P3 informational: placebo’s pre-resolve FPC drop also bypasses normal fpc= column validation, so a typoed fpc column would be silently ignored on that path. That is numerically harmless because FPC is intentionally a no-op for placebo, but it weakens input validation (diff_diff/synthetic_did.py:L343-L367, diff_diff/survey.py:L326-L329, diff_diff/survey.py:L892-L903).
  • Local execution was not possible here because this environment does not have pytest, numpy, or pandas installed.

Methodology

  • No material findings. The changed estimator logic matches the Methodology Registry’s documented survey extensions, and the previously-raised certainty case is now aligned with the survey-core zero-variance contract (diff_diff/synthetic_did.py:L1045-L1092, diff_diff/synthetic_did.py:L2424-L2441, docs/methodology/REGISTRY.md:L1566-L1620).

Code Quality

  • Severity: P3. Impact: placebo drops fpc before _resolve_survey_for_fit(), so SurveyDesign(fpc="typo") on the placebo path no longer trips SurveyDesign.resolve()’s missing-column check and is just ignored behind the no-op warning. This does not change estimates because placebo documents FPC as a no-op, but it does weaken survey-spec validation. Concrete fix: validate the original fpc column name exists before replacing it with None, or narrow the warning/registry text so it is explicit that placebo ignores fpc without resolving it. Refs: diff_diff/synthetic_did.py:L343-L367, diff_diff/survey.py:L326-L329, diff_diff/survey.py:L892-L903, docs/methodology/REGISTRY.md:L1564-L1564.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • No material findings. The previously tracked SDID placebo/jackknife survey gap was removed from TODO.md, and the remaining few-PSU jackknife limitation is documented in REGISTRY.md rather than silently deferred.

Security

  • No material findings.

Documentation/Tests

  • No material findings. The new tests cover the prior lonely_psu="certainty" issue, the new placebo feasibility guards, the full-census FPC case, undefined jackknife replicates, and the get_loo_effects_df() granularity split (tests/test_survey_phase5.py:L1417-L1848, tests/test_methodology_sdid.py:L3495-L3524).

…resolve drop

P3 (Code Quality — typoed FPC column silently ignored on placebo):
The R11 P1 fix dropped ``fpc`` from a copy of ``SurveyDesign`` BEFORE
``_resolve_survey_for_fit()`` to bypass the resolve-time
``FPC >= n_PSU`` validator on placebo. Side effect: the missing-
column check inside ``SurveyDesign.resolve()`` (survey.py:326-329 —
``raise ValueError(f"FPC column '{self.fpc}' not found in data")``)
also no longer ran on placebo. A typoed ``fpc="fpc_typo"`` would be
silently dropped behind the no-op warning, hiding a genuine input-
spec mistake even though the value is mathematically harmless.

Fix: validate the original ``survey_design.fpc`` column name exists
in ``data.columns`` BEFORE replacing it with ``None``. Raise the same
targeted error string ``SurveyDesign.resolve()`` would have raised so
input-spec mistakes still surface on placebo, even when FPC's
*value* doesn't enter the variance computation.

New regression ``test_placebo_typo_fpc_column_still_raises``:
asserts ``ValueError`` with the exact "FPC column 'nonexistent_col'
not found in data" message on a typoed FPC + placebo fit. Existing
low-FPC + placebo regressions still pass (column exists; FPC value
is dropped post-validation as before).

Verification: 101 passed (1 new column-validation regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 087fc94a8619dd96f5402da410739dc1f24e9aa2


Overall Assessment

Looks good

Executive Summary

  • The prior informational placebo/FPC validation issue is fixed: typoed fpc columns now raise before the placebo path drops FPC pre-resolve (diff_diff/synthetic_did.py:L333-L374).
  • No unmitigated P0/P1 methodology defect found in the new full-design SDID survey placebo/jackknife paths. The survey placebo branch keeps Algorithm 4’s re-estimated placebo-on-controls structure, and the survey jackknife branch keeps Algorithm 3’s fixed-weight delete-one logic while layering the documented survey allocators from REGISTRY.md (diff_diff/synthetic_did.py:L862-L1103, L1893-L2062, L2270-L2632; docs/methodology/REGISTRY.md:L1553-L1629). citeturn4view1turn4view2
  • Edge-case handling is materially stronger: placebo now front-doors Case B/C/D/E infeasibilities, and survey jackknife now distinguishes undefined delete-one replicates (SE=NaN) from legitimate zero-variance cases like full-census FPC or all-certainty singleton strata (SE=0) (diff_diff/synthetic_did.py:L901-L1016, L2413-L2632; tests/test_survey_phase5.py:L1529-L1775).
  • The jackknife results surface is now consistent: full-design survey jackknife stores PSU-level replicates and correctly blocks get_loo_effects_df(), while pweight-only jackknife retains unit-level LOO diagnostics (diff_diff/results.py:L885-L890, L1158-L1183; tests/test_survey_phase5.py:L1469-L1527).
  • P3 informational: the committed stratified_survey × jackknife MC row remains highly anti-conservative with only 2 PSUs per stratum, but that limitation is explicitly documented and therefore non-blocking (docs/methodology/REGISTRY.md:L1624-L1649; tests/test_methodology_sdid.py:L3500-L3534).
  • Local execution was not possible here because this environment does not have numpy/pytest installed.

Methodology

  • Severity: P3-informational. Impact: No unmitigated methodology issue found. The only conspicuous numerical weakness in-scope is the n_h=2 jackknife anti-conservatism, and the PR documents it transparently rather than silently shipping it as if it were calibrated. Concrete fix: None required for approval; continuing to steer few-PSU survey users toward variance_method="bootstrap" is sufficient (docs/methodology/REGISTRY.md:L1624-L1649, tests/test_methodology_sdid.py:L3500-L3534).

Code Quality

No material findings.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings. The old SDID placebo/jackknife full-design gap was removed from TODO.md, and I did not find a new untracked correctness debt in the changed code.

Security

No material findings.

Documentation/Tests

  • Severity: P3-informational. Impact: The new tests cover the risky branches well, especially undefined PSU-LOO handling, certainty-vs-remove singleton behavior, full-census FPC zero-variance, placebo Case B/C/D/E guards, and the LOO-granularity split, but I could not run them in this environment (tests/test_survey_phase5.py:L1469-L1775). Concrete fix: None for the PR; runtime confirmation just needs a dependency-complete environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant